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Abstract 

Link prediction in complex networks has attracted increasing attention from both 
physical and computer science communities. The algorithms can be used to ex- 
tract missing information, identify spurious interactions, evaluate network evolving 
mechanisms, and so on. This article summaries recent progress about link predic- 
tion algorithms, emphasizing on the contributions from physical perspectives and 
approaches, such as the random-walk-based methods and the maximum likelihood 
methods. We also introduce three typical applications: reconstruction of networks, 
evaluation of network evolving mechanism and classification of partially labelled 
networks. Finally, we introduce some applications and outline future challenges of 
link prediction algorithms. 
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1 Introduction 



Many social, biological, and information systems can be well described by 
networks, where nodes represent individuals, and links denote the relations 
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or interactions between nodes. The study of complex networks has therefore 
become a common focus of many branches of science. Great efforts have been 
made to understand the evolution of networks [1,2], the relations between 
topologies and functions [3,4], and the network characteristics [5]. An impor- 
tant scientific issue relevant to network analysis is the so-called information 
retrieval [6,7], which aims at finding material of an unstructured nature that 
satisfies an information need from large collections [8]. It can also be viewed 
as prediction of relations between words and documents and is now further 
extended to stand for a number of problems on link mining, wherein link 
prediction is the most fundamental problem that attempts to estimate the 
likelihood of the existence of a link between two nodes, based on observed 
links and the attributes of nodes [9]. 

In many biological networks, such as food webs, protein-protein interaction 
networks and metabolic networks, whether a link between two nodes exists 
must be demonstrated by field and/or laboratorial experiments, which are 
usually very costly. Our knowledge of these networks is very limited, for ex- 
ample, 80% of the molecular interactions in cells of Yeast [10] and 99.7% of 
human [11,12] are still unknown. Instead of blindly checking all possible in- 
teractions, to predict based on known interactions and focus on those links 
most likely to exist can sharply reduce the experimental costs if the predic- 
tions are accurate enough. Social network analysis also comes up against the 
missing data problem [13,14,15], where link prediction algorithms may play 
a role. In addition, the data in constructing biological and social networks 
may contain inaccurate information, resulting in spurious links [16,17]. Link 
prediction algorithms can be applied in identifying these spurious links [18]. 

Besides helping in analyzing networks with missing data, the link prediction 
algorithms can be used to predict the links that may appear in the future of 
evolving networks. For example, in online social networks, very likely but not 
yet existent links can be recommended as promising friendships, which can 
help users in finding new friends and thus enhance their loyalties to the web 
sites. Similar techniques can be applied to evaluate the evolving mechanism for 
given networks. For example, many evolving models for the Internet topology 
have been proposed: some more accurately reproduce the degree distribution 
and the disassortative mixing pattern [19], some better characterize the fc-core 
structure [20], and so on. Since there are too many topological features and 
it is very hard to put weights on them, we are not easy to judge which model 
(i.e., which evolving mechanism) is better than the others. Notice that, each 
model in principle corresponds to a link prediction algorithm, and thus we can 
use the metrics on prediction accuracy to evaluate the performance of different 
models. 

Link prediction problem is a long-standing challenge in modern information 
science, and a lot of algorithms based on Markov chains and statistical mod- 
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els have been proposed by computer science community. However, their works 
have not caught up the current progress of the study of complex networks, 
especially, they lack serious consideration of the structural characteristics of 
networks, like the hierarchical organization [21] and community structure [22], 
which may indeed provide useful information and insights for link prediction. 
Recently, some physical approaches, such as random walk processes and max- 
imum likelihood methods, have found applications in link prediction. This 
article will give detailed discussion on these new development. 

This article is organized as follows. In the next section, we will present the 
link prediction problem and the standard metrics for performance evaluation. 
Our tour of link prediction algorithms starts with the mainstreaming class of 
algorithms, the so-called similarity-based algorithms r~l. which are further clas- 
sified into three categories according to the information used by the similarity 
indices: local indices, global indices and quasi-local indices. In Section 4 and 
Section 5, we introduce the maximum likelihood algorithms and probabilistic 
models for link prediction. The applications of link prediction algorithms are 
presented in Section 6, including the reconstruction of networks, the evalua- 
tion of network evolving mechanism and the classification of partially labeled 
networks. Finally, we outline some future challenges of link prediction algo- 
rithms. 



2 Problem Description and Evaluation Metrics 

Consider an undirected network G(V, E), where V is the set of nodes and E is 
the set of links. Multiple links and self-connections are not allowed. Denote by 
U the universal set containing all ij^HIXKU possible links, where |V| denotes 
the number of elements in set V. Then, the set of nonexistent links is U — E. 
We assume there are some missing links (or the links that will appear in the 
future) in the set U — E, and the task of link prediction is to find out these 
links. 

Generally, we do not know which links are the missing or future links, otherwise 
we do not need to do prediction. Therefore, to test the algorithm's accuracy, 
the observed links, E, is randomly divided into two parts: the training set, E T , 
is treated as known information, while the probe set (i.e., validation subset), 
E p , is used for testing and no information in this set is allowed to be used for 
prediction. Clearly, E T U E p = E and E T D E p = 0. The advantage of this 
random sub-sampling validation is that the proportion of the training split 
is not dependent on the number of iterations. But with this method some 



The similarity indices between nodes are also called kernels on graphs in some 
literature of computer science community [23] 
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links may never be selected in the probe set, whereas others may be selected 
more than once, resulting in statistical bias. This limitation can be overcame 
by using the K-fold cross-validation, in which the observed links is randomly 
partitioned into K subsets. Each time one subset is selected as probe set, the 
rest K — 1 constitute the training set. The cross-validation process is then 
repeated K times, with each of the K subsets used exactly once as the probe 
set. With this method, all links are used for both training and validation, 
and each link is used for prediction exactly once. Clearly, a larger K will lead 
to smaller statistical bias yet require more computation. Some experimental 
evidences suggested that the 10- fold cross-validation is a very good tradeoff 
between cost and performance [24,25]. An extreme case called leave-one-out 
method (i.e., the | V|-flod cross-validation) will be applied in Section 6.2. 

Two standard metrics are used to quantify the accuracy of prediction algo- 
rithms: area under the receiver operating characteristic curve (AUC) [26] and 
Precision [27,28]. In principle, a link prediction algorithm provides an or- 
dered list of all non-observed links (i.e., U — E T ) or equivalently gives each 
non-observed link, say (x,y) G U — E T , a score s xy to quantify its existence 
likelihood. The AUC evaluates the algorithm's performance according to the 
whole list while the Precision only focuses on the L links with the top ranks or 
the highest scores. A detailed introduction of these two metrics is as follows. 

(i) AUC. — Provided the rank of all non-observed links, the AUC value can 
be interpreted as the probability that a randomly chosen missing link (i.e., a 
link in E p ) is given a higher score than a randomly chosen nonexistent link 
(i.e., a link in U — E). In the algorithmic implementation, we usually calculate 
the score of each non-observed link instead of giving the ordered list since 
the latter task is more time-consuming. Then, at each time we randomly pick 
a missing link and a nonexistent link to compare their scores, if among n 
independent comparisons, there are n' times the missing link having a higher 
score and n" times they have the same score, the AUC value is 



If all the scores are generated from an independent and identical distribu- 
tion, the AUC value should be about 0.5. Therefore, the degree to which the 
value exceeds 0.5 indicates how much better the algorithm performs than pure 
chance. 

(ii) Precision. — Given the ranking of the non-observed links, the Precision is 
defined as the ratio of relevant items selected to the number of items selected. 
That is to say, if we take the top-L links as the predicted ones, among which L r 
links are right (i.e., there are L r links in the probe set E p ), then the Precision 
equals L r /L. Clearly, higher precision means higher prediction accuracy. 



AUC = 



n' + 0.5?i' 




n 
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Fig. 1. An illustration about the calculation of AUC and Precision. 

Figure 1 shows an example of how to calculate the AUC and Precision. In this 
simple graph, there are five nodes, seven existent links and three nonexistent 
links ((1,2), (1,4) and (3,4)). To test the algorithm's accuracy, we need to 
select some existent links as probe links. For instance, we pick (1, 3) and (4, 5) 
as probe links, which are presented by dash lines in the right plot. Then, any 
algorithm can only make use of the information contained in the training graph 
(presented by solid lines in the right plot). If an algorithm assigns scores of 
all non-observed links as si 2 = 0.4, s 13 = 0.5, s u = 0.6, s 34 = 0.5 and 
S45 = 0.6. To calculate AUC, we need to compare the scores of a probe link 
and a nonexistent link. There are in total six pairs: S13 > S12, S13 < S14, 
s i3 = S34, S45 > S12, S45 = S14 and S45 > S34. Hence, the AUC value equals 
(3 x 1 + 2 x 0.5)/6 « 0.67. For Precision, if L = 2, the predicted links are (1, 4) 
and (4, 5). Clearly, the former is wrong while the latter is right, and thus the 
Precision equals 0.5. 



3 Similarity-Based Algorithms 

The simplest framework of link prediction methods is the similarity-based al- 
gorithm, where each pair of nodes, x and y, is assigned a score s xy , which 
is directly defined as the similarity (or called proximity in the literature) be- 
tween x and y. All non-observed links are ranked according to their scores, 
and the links connecting more similar nodes are supposed to be of higher ex- 
istence likelihoods. In despite of its simplicity, the study on similarity-based 
algorithms is the mainstream issue. In fact, the definition of node similarity is 
a nontrivial challenge. Similarity index can be very simple or very complicated 
and it may work well for some networks while fails for some others. In addi- 
tion, the similarities can be used in a more skilled way, such as being locally 
integrated under the collaborative filtering^] framework [30] . 

2 Collaborative filtering is the process of filtering for information or patterns using 
techniques involving collaboration among multiple agents, viewpoints, data sources, 
etc. [29] 
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Node similarity can be defined by using the essential attributes of nodes: 
two nodes are considered to be similar if they have many common features 
[31]. However, the attributes of nodes are generally hidden, and thus we fo- 
cus on another group of similarity indices, called structural similarity, which 
is based solely on the network structure. The structural similarity indices 
can be classified in various ways, such as local vs. global, parameter-free vs. 
parameter-dependent, node-dependent vs. path-dependent, and so on. The 
similarity indices can also be sophisticatedly classified as structural equiva- 
lence and regular equivalence. The former embodies a latent assumption that 
the link itself indicated a similarity between two endpoints (see, for example, 
the Leicht- Holme- Newman index [32] and transferring similarity [33]), while 
the latter assumes that two nodes are similar if their neighbors are similar. 
Readers are encouraged to see Ref. [34] for the mathematical definition of 
regular equivalence and Ref. [35] for a recent application on the prediction of 
protein functions. 

Here we adopt the simplest method, where 20 similarity indices are classified 
into three categories: the former 10 are local indices, followed by 7 global 
indices, and the last 3 are quasi-local indices, which do not require global 
topological information but make use of more information than local indices. 



3. 1 Local Similarity Indices 

(1) Common Neighbours (CN). For a node x, let denote the set of neigh- 
bors of x. In common sense, two nodes, x and y, are more likely to have 
a link if they have many common neighbors. The simplest measure of this 
neighborhood overlap is the directed count, namely 



where \Q\ is the cardinality of the set Q. It is obvious that s xy = (A 2 ) xy , 
where A is the adjacency matrix: A xy — 1 if x and y are directly connected 
and A xy = otherwise. Note that, (A 2 ) xy is also the number of different paths 
with length 2 connecting x and y. Newman [36] used this quantity in the study 
of collaboration networks, showing a positive correlation between the number 
of common neighbors and the probability that two scientists will collaborate 
in the future. Kossinets and Watts [14] analyzed a large-scale social network, 
suggesting that two students having many mutual friends are very probable 
to be friend in the future. 




(2) 



6 



(2) Salton Index [6]. It is defined as 



Saltan I - \"" s ' ' - \& 1 1 /o\ 

h xy ~ r — 5 y<J) 



\r(x)nr( y )\ 



X }\,y 



where k x is the degree of node x. The Salton index is also called the cosine 
similarity in the literature. 

(3) Jaccard Index [37]. This index was proposed by Jaccard over a hundred 
years ago, and is defined as: 



Jaccard = \^(x)nT(y)\ 

"|r(x)ur( y )r [) 



(4) S0rensen Index [38]. This index is used mainly for ecological community 
data, and is defined as 



s S rensen = g£0^l£M! /qn 
rv x ^ rhy 



(5) Hub Promoted Index (HPI) [39]. This index is proposed for quantifying 
the topological overlap of pairs of substrates in metabolic networks, and is 
defined as 



\r(x)nv( y )\ 



xy mm{k x , k y } 



Under this measurement, the links adjacent to hubs are likely to be assigned 
high scores since the denominator is determined by the lower degree only. 

(6) Hub Depressed Index (HDI). Analogously to the above index, we also 
consider a measurement with the opposite effect on hubs, defined as 

hdi = |r(s)nr(y)| 

" max{k x ,k v }- [) 



(7) Leicht-Holme-Newman Index (LHN1) [32]. This index assigns high simi- 
larity to node pairs that have many common neighbors compared not to the 
possible maximum, but to the expected number of such neighbors. It is defined 

as 



\T(x)nT{y)\ 



x y h v h ' y ' 

ft,™ A ft,,. 
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where the denominator, k x x k y , is proportional to the expected number of 
common neighbors of nodes x and y in the configuration model [40]. We use 
the abbreviation LHN1 to distinguish this index to another index (named as 
LHN2 index) also proposed by Leicht, Holme and Newman. 

(8) Preferential Attachment Index (PA). The mechanism of preferential at- 
tachment can be used to generate evolving scale-free networks, where the 
probability that a new link is connected to the node x is proportional to k x 
[41]. A similar mechanism can also lead to scale- free networks without growth 
[42], where at each time step, an old link is removed and a new link is gener- 
ated. The probability that this new link will connect x and y is proportional 
to k x x k y . Motivated by this mechanism, the corresponding similarity index 
can be defined as 

S xy = k x ~x k y , (9) 



which has been widely used to quantify the functional significance of links 
subject to various network-based dynamics, such as percolation [43], synchro- 
nization [44] and transportation [45]. Note that this index does not require 
the information of the neighborhood of each node, as a consequence, it has 
the least computational complexity. 

(9) Adamic-Adar Index (AA) [46]. This index refines the simple counting of 
common neighbors by assigning the less-connected neighbors more weight, and 
is defined as 

<= E rV ( 10 ) 

zer(x)nr( y ) io & Kz 



(10) Resource Allocation Index (RA) [47]. This index is motivated by the 
resource allocation dynamics on complex networks [48]. Consider a pair of 
nodes, x and y, which are not directly connected. The node x can send some 
resource to y, with their common neighbors playing the role of transmitters. 
In the simplest case, we assume that each transmitter has a unit of resource, 
and will equally distribute it to all its neighbors. The similarity between x and 
y can be defined as the amount of resource y received from x, which is: 

4?= E jp (ii) 



Clearly, this measure is symmetric, namely s xy = s yx . Note that, although 
resulting from different motivations, the AA index and RA index have very 
similar form. Indeed, they both depress the contribution of the high-degree 
common neighbors. AA index takes the form (log/c z ) _1 while RA index takes 
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Table 1 

Accuracies of different local similarity indices subject to link prediction, measured 
by the AUC value. Each number is obtained by averaging over 10 implementations 
with independently random partitions of testing set (90%) and probe set (10%). 
The entries corresponding to the highest accuracies among these 10 indices are em- 
phasized in black. The six real networks for testing are a protein-protein interaction 
network (PPI) [16], a co-authorship network of scientists who are themselves pub- 
lishing on the topic of network science (NS) [50], an electrical power- grid of the 
western US (Grid) [51], a network of the US political blogs (PB) [52], a router-level 
Internet collected by Rocketfuel Project (INT) [53], and a network of the US air 
transportation system (US Air) [54]. Detailed information about these networks can 
be found in Ref. [47]. 



Indices 


PPI 


NS 


Grid 


PB 


INT 


USAir 


CN 


0.889 


0.933 


0.590 


0.925 


0.559 


0.937 


Salton 


0.869 


0.911 


0.585 


0.874 


0.552 


0.898 


Jaccard 


0.888 


0.933 


0.590 


0.882 


0.559 


0.901 


S0rensen 


0.888 


0.933 


0.590 


0.881 


0.559 


0.902 


HPI 


0.868 


0.911 


0.585 


0.852 


0.552 


0.857 


HDI 


0.888 


0.933 


0.590 


0.877 


0.559 


0.895 


LHN1 


0.866 


0.911 


0.585 


0.772 


0.552 


0.758 


PA 


0.828 


0.623 


0.446 


0.907 


0.464 


0.886 


AA 


0.888 


0.932 


0.590 


0.922 


0.559 


0.925 


RA 


0.890 


0.933 


0.590 


0.931 


0.559 


0.955 



the form k^ 1 . The difference is insignificant when the degree, k z , is small, 
while it is considerable when k z is large. In a word, RA index punishes the 
high-degree common neighbors more heavily than AA. 

Liben-Nowell et al. [49] and Zhou et al. [47] systematically compared a number 
of local similarity indices on many real networks: the former [49] focuses on 
social collaboration networks and the latter [47] considers disparate networks 
including the protein-protein interaction network, electronic grid, Internet, 
US airport network, etc. According to extensive experimental results on real 
networks (see results in Table 1), the RA index performs best, while AA and 
CN indices have the second best overall performance among all the above- 
mentioned local indices. 

The PA index has the worst overall performance, yet we are interested in 
it for it requires the least information. Notice that, PA performs even worst 
than pure chance for the Internet at router level and the power grid. In these 
two networks, the nodes have well-defined positions and the links are physi- 
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cal lines. Actually, geography plays a significant role and links with very long 
geographical distances are rare. As local centers, the high-degree nodes have 
longer geographical distances to each other than average, and thus have a 
lower probability of directly connecting to each other, which leads to the bad 
performance of PA. In contrast, although USAir has well-defined geographical 
positions of nodes, its links are not physical. Empirical data has demonstrated 
that the number of airline flights is not very sensitive to the geographical 
distance within a big range [55,56] (another topological evidence for the rela- 
tively good performance of PA on USAir is the so-called rich-club phenomenon 
[57,58]). The LHN1 index performs the second worst, however, compared with 
all other neighborhood-based indices, it is very good at uncovering the missing 
links connecting two small-degree nodes [59]. 

Recently, Pan et al. [60] have compared all the local indices appeared in Ref. 
[47] in a similarity-based community detection algorithm, and their experi- 
mental results again indicate that the RA index performs best. Wang et al. 
[61] have applied the RA index to estimate the weights between stations in 
Chinese railway, which shows better performance than the CN index. In ad- 
dition, the RA index for bipartite networks can be applied in personalized 
recommendation with higher accuracy than the classical collaborative filter- 
ing [62]. 



3.2 Global Similarity Indices 

(11) Katz Index [63]. This index is based on the ensemble of all paths, which 
directly sums over the collection of paths and is exponentially damped by 
length to give the shorter paths more weights. The mathematical expression 
reads 

oo 

s« y atz = £0' • \paths<l>\ = (3A xy + (3 2 (A 2 ) xy + (3 3 (A% y + • ■ ■ , (12) 
i=i 

where paths^ is the set of all paths with length I connecting x and y, and 
(3 is a free parameter (i.e., the damping factor) controlling the path weights. 
Obviously, a very small j3 yields a measurement close to CN, because the long 
paths contribute very little. The similarity matrix can be written as 

gKatz = y_ p A yl _ j ( 13 ) 

Note that, /3 must be lower than the reciprocal of the largest eigenvalue of 
matrix A to ensure the convergence of Eq. 12. 
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(12) Leicht- Holme- Newman Index (LHN2) [32]. This index is a variant of the 
Katz index. Based on the concept that two nodes are similar if their imme- 
diate neighbors are themselves similar, one obtains a self-consistent matrix 
formulation 

S = <PAS + tpi = - (PA)- 1 = + <f)A + <p 2 A 2 + ■••), (14) 



where <fi and ip are free parameters controlling the balance between the two 
components of the similarity. Setting ?/> = 1, it is very similar to the Katz 
index. Note that {A l ) xy is equal to the number of paths of length I from x 
to y. The expected value of (A l ) xy , namely E[(A l ) xy ], equals (k x k y /2M)\i , 
where Ai is the largest eigenvalue of A and M is the total number of edges 
in the network. Replace (A l ) xy in Eq. 14 with (A 1 ) xy / E[(A l ) xy ], we obtain the 
expression: 



^LHN2 
'xy 



$xy ~\~ 



2M ~ 



4> l X l ~\A l ) 



■>-y 



2M\ l 



2M\ l 



where 5 xy is the Kronecker's function. Since the first item is a diagonal matrix, 
it can be dropped and thus we arrive to a compact expression 



S = 2m\ 1 D~ 1 (I Y 1 D~\ (16) 
Ai 



where D is the degree matrix with D xy = 5 xy k x and <p (0 < <ft < 1) is a 
free parameter. The choosing of <fi depends on the investigated network, and 
smaller <fi assigns more weights on shorter paths. 

(13) Average Commute Time (ACT). Denote by m(x,y) the average number 
of steps required by a random walker starting from node x to reach node y, 
the average commute time between x and y is 

n(x,y) = m(x,y) + m(y,x), (17) 



which can be obtained in terms of the pseudoinverse of the Laplacian matrix, 
L+ (L = D-A), as [64,65]: 

n{x,y) = M{lt x + Z+ -2Z+), (18) 



where l xy denotes the corresponding entry in L + . Assuming two nodes are 
more similar if they have a smaller average commute time, then the similarity 



between the nodes x and y can be denned as the reciprocal of n(x, y), namely 
(the constant factor M is removed) 



1 



<° T = ,+ ■ J ^ • (19) 



x v /++/+_ 21+ 

xx 1 "yy xy 



(14) Cosine based on L + . This index is an inner-product-based measure. In the 
Euclidean space spanned by v x = K^U T e~ X) where U is an orthonormal matrix 
made of the eigenvectors of L + ordered in decreasing order of corresponding 
eigenvalue X x , A = diag(A x ), is an N X 1 vector with the xth element 
equal to 1 and others all equal to 0, and T is the matrix transposition, the 
pseudoinverse of the Laplacian matrix are the inner products of the node 
vectors, = v£v y . Accordingly, the cosine similarity is defined as the cosine 
of the node vectors, namely [65]: 



vEv„ It 



8 %> + = cos(x, y) + = , 77 , = . xy . (20) 



\ V v\ Stx-Vyy 



(15) Random Walk with Restart (RWR). This index is a direct application of 
the PageRank algorithm [66]. Consider a random walker starting from node x, 
who will iteratively moves to a random neighbor with probability c and return 
to node x with probability 1 — c. Denote by q xy the probability this random 
walker locates at node y in the steady state, we have 

q x = cP T q x + (1 - c)e x , (21) 



where P is the transition matrix with P xy — l/k x if x and y are connected, 
and P xy = otherwise. The solution is straightforward, as 

q- x = (l-c)(I-cP T )- 1 e-;. (22) 



The RWR index is thus defined as 

S xy = Qxy + Qyx- (23) 



A fast algorithm to calculate this index was proposed by Tong et al. [67], and 
the application of this index to recommender systems can be found in Ref. 
[68]. 

(16) SimRank [69]. Similar to the LHN2, SimRank is defined in a self-consistent 
way, according to the assumption that two nodes are similar if they are con- 
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nected to similar nodes: 



Ev~^ „SimRank 
ru x rvy 



where s xx = 1 and C G [0, 1] is the decay factor. The SimRank can also be 
interpreted by the random-walk process, that is, s^ nRank measures how soon 
two random walkers, respectively starting from nodes x and y, are expected 
to meet at a certain node. 

(17) Matrix Forest Index (MFI) [70]. This index is defined as 

S = (I + L)-\ (25) 



where the similarity between x and y can be understood as the ratio of the 
number of spanning rooted forests such that nodes x and y belong to the same 
tree rooted at x to all spanning rooted forests of the network (see details in 
Ref. [70]). A parameter-dependent variant of MFI is 

S = {I + aL)-\ a>0. (26) 



This index has been applied to quantify the similarity between nodes on collab- 
orative recommendation task [71]. The results indicate that a simple nearest- 
neighbors rule based on similarity measured by MFI performs best. 

Comparing with the local similarity indices, the global ones ask for the whole 
topological information. Although the global indices can provide much more 
accurate prediction than the local ones, they suffer two big disadvantages: 
(i) the calculation of a global index is very time-consuming, and is usually 
infeasible for large-scale networks; (ii) sometimes, the global topological infor- 
mation is not available, especially if we would like to implement the algorithm 
in a decentralized manner. As we will show in the next subsection, a promising 
tradeoff is the quasi-local indices, which considers more information than local 
indices while abandons the superfluous information that makes no contribu- 
tion or very little contribution to the prediction accuracy. 



3.3 Quasi-Local Indices 



(18) Local Path Index (LP) [47,72]. To provide a good tradeoff of accuracy and 
computational complexity, we here introduce an index that takes consideration 
of local paths, with wider horizon than CN. It is defined as 

S LP = A 2 + eA 3 , (27) 
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where e is a free parameter. Clearly, this measure degenerates to CN when 
e = 0. And if x and y are not directly connected (this is the case we are 
interested in), {A 3 ) xy is equal to the number of different paths with length 3 
connecting x and y. This index can be extended to account for higher-order 
paths, as 



where n > 2 is the maximal order. With the increasing of n, this index asks 
for more information and computation. Especially, when n — > oo, S LP(jl ^ will 
be equivalent to the Katz index that takes into account all paths in the net- 
work. The computational complexity of this index in an uncorrelated network 
is Q(N (k) n ) , which grows fast with the increasing of n and will exceed the 
complexity for calculating the Katz index (approximate to Q(iV 3 )) for large 
n. Experimental results show that the optimal n is positively correlated with 
the average shortest distance of the network [72]. 

The LP index performs remarkably better than the neighborhood-based in- 
dices, such as RA, AA and CN [47] . It is because the neighborhood information 
is less distinguishable and two node pairs are of high probability to be assigned 
the same similarity scores. Taking INT as an example, there are more than 10 7 
node pairs, 99.59% of which are assigned zero score by CN. For all the node 
pairs having scores higher than 0, 91.11% are assigned score 1, and 4.48% 
are assigned score 2. Using a little bit more information involving the next 
nearest neighbors may break the "degeneracy of the states and make the simi- 
larity scores more distinguishable. This is the reason why the LP index largely 
improves the prediction accuracy. 

The comparison of LP index with other two path-dependent global indices, the 
Katz and LHN2 indices, is shown in Table 2. Overall speaking, the Katz index 
performs best subject to the AUC value, while the LP index is the best for the 
Precision. For the network with small average shortest distance (e.g., US Air 
and PB), LP index gives the most accurate predictions for both AUC and 
Precision. In a word, the LP index provides competitively good predictions 
while asks for much lighter computation compared with the global indices. 

(19) Local Random Walk (LRW) [73]. To measure the similarity between nodes 
x and y, a random walker is initially put on node x and thus the initial density 
vector n x (0) = e x . This density vector evolves as n x (t + l) = P T n x (t) for t > 0. 
The LRW index at time step t is thus defined as 



where q is the initial configuration function. In Ref. [73] Liu and Lii applied a 
simple form determined by node degree, namely q x = ff . 



S LP(n) =A 2 + eA 3 + e 2 A 4 + ... + e ^ A n , 



(28) 



S xy (<) = QxTTxy{t) + q y K yx {t). 



(29) 
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Table 2 

Accuracies of the three path-dependent similarity indices, measured by AUC and 
precision. Here only the main components of example networks are considered (see 
Ref. [72] for detailed information). Each number is obtained by averaging over 10 
independent realizations. The entries corresponding to the highest accuracies are 
emphasized in black. For LP, Katz and LHN2 indices, the AUC values are corre- 
sponding to the optimal parameter which will be used to calculate their correspond- 
ing precision where we set L = 100. For USAir, the optimal value of e is negative 
(see the explanation in Ref. [47]). LP* denotes the LP index with a fixed parameter 
e = 0.01 (for USAir e = —0.01). The very small difference between the optimal case 
and the case with e = 0.01 suggests that in the real application, one can directly 
set e as a very small number, instead of finding out its optimum that may cost 
much time. This again supports our motivation that the essential advantage of the 
uasge of the second order neighborhood is to improve the distinguishability of the 
similarity scores. 



AUC 


PPI 


NS 


Grid 


PB 


INT 


USAir 


LP 


0.970 


0.988 


0.697 


0.941 


0.943 


0.960 


LP* 


0.970 


0.988 


0.697 


0.939 


0.941 


0.959 


Katz 


0.972 


0.988 


0.952 


0.936 


0.975 


0.956 


LHN2 


0.968 


0.986 


0.947 


0.769 


0.959 


0.778 


Precision 


PPI 


NS 


Grid 


PB 


INT 


USAir 


LP 


0.734 


0.292 


0.132 


0.519 


0.557 


0.627 


LP* 


0.734 


0.292 


0.132 


0.469 


0.121 


0.627 


Katz 


0.719 


0.290 


0.063 


0.456 


0.368 


0.623 


LHN2 





0.060 


0.005 








0.005 



(20) Superposed Random Walk (SRW) [73]. Similar to the RWR index, Liu and 
Lii [73] proposed the SRW index, where the random walker is continuously 
released at the starting point, resulting in a higher similarity between the 
target node and the nodes nearby. The mathematical expression reads 

= E s *f*» = E + w*(t)]> (3°) 

r=l T=l 

where t denotes the time steps. 

Liu and Lii [73] systematically compared these two indices, LRW and SRW, 
with five other indices, including three local (or quasi-local) indices, CN, RA 
and LP, and two other random-walk-based global indices, ACT and RWR, as 
well as the hierarchical structure method (HSM) proposed by Clauset, Moore 
and Newman [75] (see Section 4.1 for the detailed introduction of HSM). 



15 



Table 3 

Comparison of algorithms' accuracy quantified by AUC and Precision. For each 
network, the training set contains 90% of the known links. Each number is obtained 
by averaging over 1000 implementations with independently random divisions of 
training set and probe set. The parameters e = 10~ 3 for LP (for US Air, e = — 10 -3 ) 
and c = 0.9 for RWR. The numbers inside the brackets denote the optimal step of 
LRW and SRW indices. For example, 0.972(2) means the optimal AUC is obtained 
at the second step of LRW. The highest accuracy in each line is emphasized in black. 
For HSM, 5000 samples of dendrograms for each implementation are generated. 



AUC 


CN 


RA 


LP 


ACT 


RWR 


HSM 


LRW 


SRW 


USAir 


0.954 


0.972 


0.952 


0.901 


0.977 


0.904 


0.972(2) 


0.978(3) 


NetScience 


0.978 


0.983 


0.986 


0.934 


0.993 


0.930 


0.989(4) 


0.992(3) 


Power 


0.626 


0.626 


0.697 


0.895 


0.760 


0.503 


0.953(16) 


0.963(16) 


Yeast 


0.915 


0.916 


0.970 


0.900 


0.978 


0.672 


0.974(7) 


0.980(8) 


C.elegans 


0.849 


0.871 


0.867 


0.747 


0.889 


0.808 


0.899(3) 


0.906(3) 


Precision 


CN 


RA 


LP 


ACT 


RWR 


HSM 


LRW 


SRW 


USAir 


0.59 


0.64 


0.61 


0.49 


0.65 


0.28 


0.64(3) 


0.67(3) 


NetScience 


0.26 


0.54 


0.30 


0.19 


0.55 


0.25 


0.54(2) 


0.54(2) 


Power 


0.11 


0.08 


0.13 


0.08 


0.09 


0.00 


0.08(2) 


0.11(3) 


Yeast 


0.67 


0.49 


0.68 


0.57 


0.52 


0.84 


0.86(3) 


0.73(9) 


C.elegans 


0.12 


0.13 


0.14 


0.07 


0.13 


0.08 


0.14(3) 


0.14(3) 



According to the experimental results (see Table 3), LRW and SRW methods 
perform better than other indices with their respective optimal walking step 
positively correlated with the average shortest distance of the network. 

Furthermore, the computational complexity of LRW and SRW is lower than 
ACT and RWR whose time complexity in calculating inverse and pseudoin- 
verse is approximately <D(N 3 ), while the time complexity of n-steps LRW and 
SRW are approximately Q(N (k) n ) , ignoring degree heterogeneity of the net- 
work. That is to say, when n is small LRW and SRW run much faster than 
other random- walk-based global similarity indices. The advantage of LRW and 
SRW for their low calculation complexity is prominent especially in the huge 
size (i.e. large N) and sparse (i.e. small (k)) networks. For example, LRW or 
SRW for power grid is thousands time faster than ACT, cos + and RWR, even 
for n ~ 10 [73]. 

With the similar motivation of LRW and SRW, Mantrach et al. recently pro- 
posed a bounded normalized random walk with restart algorithm (see Eq. 21 
for the definition of RWR), and applied it to address the classification problem 



1(3 



[74]. With this method both complexities of time and space can be reduced. 



4 Maximum Likelihood Methods 

This section will introduce two recently proposed algorithms based on the 
maximum likelihood estimation. These algorithms presuppose some organiz- 
ing principles of the network structure, with the detailed rules and specific 
parameters obtained by maximizing the likelihood of the observed structure. 
Then, the likelihood of any non-observed link can be calculated according to 
those rules and parameters. 

From the viewpoint of practical applications, an obvious drawback of the max- 
imum likelihood methods is that it is very time consuming. A well designed 
algorithm is able to handle networks with up to a few thousand nodes in a 
reasonable time, but will definitely fail to deal with the huge online networks 
that often consist of millions of nodes. In addition, the maximum likelihood 
methods are probably not among the most accurate ones (see, for example, the 
comparison between hierarchical structure model and some typical similarity- 
based methods in Table 3). However, the maximum likelihood methods provide 
very valuable insights into the network organization, which can not be gained 
from the similarity-based algorithms or the probabilistic models. 

4-1 Hierarchical Structure Model 

Empirical evidence indicates that many real networks are hierarchically orga- 
nized, where nodes can be divided into groups, further subdivided into groups 
of groups, and so forth over multiple scales [21] (e.g., metabolic networks [39] 
and brain networks [76]). As Redner said [77], focusing on the hierarchical 
structure inherent in social and biological networks might provide a smart 
way to find missing links. Clauset, Moore and Newman [75] proposed a gen- 
eral technique to infer the hierarchical organization from network data and 
further applied it to predict the missing links. 

As shown in Fig. 2, the hierarchical structure of a network can be represented 
by a dendrogram with N leaves (corresponding to the nodes of the network) 
and N — 1 internal nodes. Clauset, Moore and Newman [75] introduced a 
simple model where each internal node r is associated with a probability p r 
and the connecting probability of a pair of nodes (leaves) is equal to p r r where 
r' is the lowest common ancestor of these two nodes. Given a real network G 
and a dendrogram D, let E r be the number of edges in G whose endpoints 
have r as their lowest common ancestor in D, and let L r and R r , respectively, 
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Fig. 2. Illustration of a dendrogram of a network with 5 nodes. Accordingly, the 
connecting probability of nodes 1 and 2 is 0.5, of nodes 1 and 3 is 0.3, of nodes 3 
and 4 is 0.4. 

be the number of leaves in the left and right subtrees rooted at r. Then the 
likelihood of the dendrogram D together with a set of p r is 

C(D,{p r }) = n?^(l -Pr)****-*. (31) 

r 

For a fixed D, it is obvious that 

P* = 7~TT (32) 



maximizes C(D, {p r })- Therefore, according to the maximum likelihood method 
[78], with a fixed D, it is easy to determine {p r } (by Eq. 32) that best fits the 
network G. Figure 3 shows an example network and two possible dendrograms, 
as well as the corresponding likelihoods. It is in accordance with the common 
sense that D 2 is more likely. A Markov chain Monte Carlo method is used 
to sample dendrograms with probability proportional to their likelihood (see 
the Supplementary Information of Ref. [75] and a benchmark book [79] for 
details). 

The algorithm to predict the missing links contains the following procedures: 
(i) Sample a large number of dendrograms with probability proportional to 
their likelihood; (ii) For each pair of unconnected nodes % and j, calculate the 
mean connecting probability (p^) by averaging the corresponding probability 
Pij over all sampled dendrograms; (hi) Sort these node pairs in descending 
order of (p^) and the highest-ranked ones are those to be predicted. According 
to the AUC value, this algorithm outperforms the CN index for the terrorist 
association network [80] and the grassland species food web [81], while loses 
for the metabolic network of the spirochaete Treponema Pallidum [82]. 
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Fig. 3. The likelihood of two possible dendrograms for an example network consisting 
of 6 nodes. The interval nodes are labeled with the maximum likelihood probability 
obtained by Eq. 32. The likelihoods are C{D\) ~ 0.00165 (left dendrogram) and 
C{D\) ~ 0.0433 (right dendrogram). Reprinted figure with permission from [75], 
copyright is held by Nature Publishing Group. 

The hierarchical structure model provides a smart way to predict missing links, 
and, maybe more significantly, it uncovers the hidden hierarchical organization 
of networks. However, as mentioned above, a big disadvantage is that this 
algorithm runs very slow. Actually, the process to sample dendrograms usually 
asks for 0(iV 2 ) steps of the Markov chain [75], and in the worst case, it takes 
exponential time [83]. In comparison, according to the CPU of an advanced 
desktop computer, the hierarchical structure model cannot manage a network 
of tens of thousands nodes, while the algorithms based on local similarity 
indices can deal with networks with tens of millions nodes. Another noticeable 
remark is that this model may give poor predictions for those networks without 
clear hierarchical structures. 



4-2 Stochastic Block Model 

Stochastic block model [84,85,86,87] is one of the most general network models, 
where nodes are partitioned into groups and the probability that two nodes are 
connected depends solely on the groups to which they belong. The stochastic 
block model can capture the community structure [22], role-to-role connections 
[88] and maybe other factors for the establishing of connections, especially 
when the group membership plays the considerable roles in determining how 
nodes interact with each other, which usually could not be well described by 
the simple assortativity coefficient [89,90] or the degree-degree correlations 



19 



Fig. 4. An illustration about the calculation of likelihood for the stochastic block 
model. 

[91,92]. 

Given a partition A4 where each node belongs to one group and the connecting 
probability for two nodes respectively in groups a and (3 is denoted by Q a p 
(Q aa represents the probability that two nodes within group a are connected), 
then the likelihood of the observed network structure is [18]: 

C(A\M) = II <2a|(l - Q a ?Y^, (33) 

a</3 



where l a p is the number of edges between nodes in groups a and (3 and r a p is 
the number of pairs of nodes such that one node is in a and the other is in /3. 
Similar to Eq. 32, the optimal Q a p that maximizes the likelihood C(A\M) is: 

Q«e = ^- (34) 



A simple illustration is shown in Fig. 4. Given a partition Jv[ = {{1, 2, 3}, {4, 5, 6, 7}}, 
according to Eq. 34, the Q values corresponding to the maximum likelihood 
are Q\ x = | = 1, Q\ 2 = ^ = \, Q22 = \i an( i thus the likelihood is 

^ ix Q 2 (5) 10x (g) 5 Q) K 3005 xi °" < 35 » 



Denote by Q the set of all possible partitions, using Bayes' Theorem [93], the 
reliability of an individual link is [18]: 

R Ci A \\A\ Jn£(A xy = l\M)C(A\M) P (M)dM 

K xy - L(A xy -l\A)- Jn c(A\M')p(M>)dM> ' {db) 



where p(M) is a constant assuming no prior knowledge about the model. 
Notice that, the number of different partitions of elements grows faster 
than any finite power of N, and thus even for a small network, to sum over 
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all partitions is not possible in practice. The Metropolis algorithm [94] can be 
applied to estimate the link reliability [18]. Even though, the whole process is 
very time consuming and this method can only manage networks with up to 
a few thousands of nodes. 

Reliability describes the likelihood of the existence of a link (i.e., the prob- 
ability that the link "truly" exists) given the observed structure [18], which 
can be used not only to predict missing links (the nonexistent links in the ob- 
served network yet with the highest reliabilities) but also to identify possible 
spurious links (the existent links with the lowest reliabilities). Empirical com- 
parison on five disparate networks (the social interactions in a karate club [95], 
the social network of frequent associations between 62 dolphins [96], the air 
transportation network of Eastern Europe [97], the neural network of the ne- 
matode Caenorhabditis elegans [98], and the metabolic network of Escherichia 
coli [99]) indicated that the overall performance of the maximum likelihood 
method based on stochastic block model [18] is better than the one based on 
the hierarchical structure model [75] and the similarity-based algorithm for 
common neighbors [49]. 



5 Probabilistic Models 

Probabilistic models aim at abstracting the underlying structure from the ob- 
served network, and then predicting the missing links by using the learned 
model. Given a target network G = (V,E), the probabilistic model will opti- 
mize a built target function to establish a model composed of a group of param- 
eters 0, which can best fit the observed data of the target network. Then the 
probability of the existence of a nonexistent link is estimated by the con- 
ditional probability P{Aij = 1|0). This section will introduce the three main- 
stream methods, respectively called Probabilistic Relational Model (PRM) 
[100], Probabilistic Entity Relationship Model (PERM) [101] and Stochastic 
Relational Model (SRM) [102]. Notice that, in some literature, the term PRM 
only refers to a specific model which is usually called the Relational Bayesian 
Networks nowadays, while we adopt the more general usage of PRM in this 
review. 



5.1 Probabilistic Relational Models 

PRMs represent a joint probability distribution over the attributes of a re- 
lational dataset. They allow the properties of an object to dependent prob- 
abilistically both on other properties of that object and on properties of the 
related objects. Different from the traditional graphical models using a single 
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graph to model the relationship among the attributes of homogenous entities, 
PRMs contain three graphs [103]: the data graph Gd, the model graph Gm, 
and the inference graph Gi. These correspond to the skeleton, model, and 
ground graph as outlined by Heckerman et al. [104]. 

The data graph Gd = (Vd,E d ) presents the input network, where nodes 
are the objects in the data and edges represent the relationships among the 
objects. Each node Vi G Vd and edge ej G E D are associated with a type 
T(vi) = t Vi , T(ej) = t e .. Each item (either object or edge) type t G T has a 
number of associated attributes X*. Consequently, each object and link 
ej are associated with a set of attribute values, x v * and Xe- , determined 
by their types, t v and t e , respectively. A PRM represents a joint proba- 
bility distribution over the values of all the attributes in the data graph, 
x = {xvi : Vi G V D ,T(i>i) = t Vi }{J{x e ] 3 : ej G E D ,T{e j ) = t ej }. For example, 
in the student-course selection system, the students and courses are nodes 
and the edges represent the select relationship between students and courses. 
Clearly, there are two types of nodes, namely student and course. And the 
type student has four attributes: grade, age, sex and department, while the 
type course has five attributes: category, teacher, year, time and discipline. 

The model graph Gm = (Vm, E m ) represents the dependencies among at- 
tributes at the level of item types. Attributes of an item can depend proba- 
bilistically on other attributes of the same item, as well as on attributes of 
other related objects or links in Go- Each node in Vm corresponds to an at- 
tribute Xj G X* where t G T. The attributes with the same type in Gd are 
tied together. Thus Gd is decomposed into multiple examples of each type, 
based on which a joint model of dependencies among the type attributes can 
be built. Gm contains two parts: the dependent structure among all the type 
attributes and the conditional probability distribution (CPD) associated with 
the nodes in Gm- 

The inference graph Gj = (V/, Ej) represents the probabilistic dependencies 
among all the variables in a single test set. It can be instantiated by a roll- 
out process of Go and Gm- Each item-attribute pair in Gd gets a separate, 
local copy of the corresponding CPD from Gm- The relations in Gd determine 
the way that Gm is rolled out to form Gj. Therefore the structure of Gj is 
determined by both Go and Gm- 

With respect to different representations of the modeled graph Gm and the 
corresponding learning and inferring procedures, PRMs can be classified into 
three groups: Relational Bayesian Networks (RBNs), Relational Markov Net- 
works (RMNs) and Relational Dependency Networks (RDNs). 

Relational Bayesian Networks (RBNs) [100,105] — The model graph presented 
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by a RBN is a directed acyclic grapnel with a set of CPDs, P, to represent 
a joint distribution over the attributes of the item types. The set P contains 
a conditional probability distribution for each variable given its parental 
p(x\pa x ), where pa x denotes the parents of node. Thus the joint probabilistic 
distribution can be calculated as 

M x ) = n n n pKip^o n pk\p^o- ( 3? ) 

teT XteX* v:T(v)=t e:T(e)=t 



The need to avoid cycles in RBN leads to significant representational and 
computational difficulties. Inference is done by creating the complete ground 
network, which limits their scalability. RBN requires specifying a complete 
conditional model for each attribute of each class, which in large complex 
domains can be quite burdensome. 

Relational Markov Networks (RMNs) [106,107] — A RMN uses an undirected 
graph and a set of potential functions $ to represent the joint distribution over 
the attributes of the item types. Denote by C the set of cliques in the graph 
and each clique c G C is associated with a set of variables X c (g X*) (i.e. the 
nodes in this clique 0) and a clique potential <& c (x c ) which is a non- negative 
function over the possible values for x c G X c , then the joint probability over 
x is calculated with the formula 

pw = \ n *cO&c), (ss) 



where Z is a normalizing constant, which sums over all possible instantiations. 
RMNs are trained discriminatively, and do not specify a complete joint distri- 
bution for the variables in the model. The learning algorithm uses maximum 
a posteriori (MAP) estimation with belief propagation for inference, which 
leads to a high computational complexity for learning. 

Relational Dependency Networks (RDNs) [110,111] — The RDN is a bi-directed 
graphical model with a set of conditional probability distributions, which can 
be used to represent the cyclic dependencies in a relational setting. RDNs use 
pseudo-likelihood learning techniques to estimate an efficient approximation 
of the full joint distribution of the attribute values in a relational dataset. The 
pseudo-likelihood for data graph Gr, is computed as a product over the item 



3 A directed graph is acyclic if there is no directed path that starts and ends at the 
same variable. This constrain indicates that a random variable does not depend, 
directly or indirectly, on its own value. 

4 A direct link from a to b indicates that a is fe's parent node. 

5 Actually a node in a clique corresponds to an attribute in the data graph. 
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types t, the attributes of that type X*, and the nodes v and edges e of that 
type: 

pl(g d ;0) = U n II p«b«^;0) II K<b<H;©)- (39) 

teTXjeX* v.T(v)=t e:T(e)=t 

The CPDs in the RDN pseudo-likelihood are not required to factor the joint 
distribution of Go. More specifically, when consider the variable x*., we con- 
dition on the values of the parents pa x t regardless of whether the estimation 

v i 

of CPDs for variables in pa x t v was conditioned on x\.. RDN adopts Gibbs 
samplin<^\ to iteratively relabel each unobserved variable by drawing from its 
local conditional distribution, given the current state of the rest of the graph. 

5.2 Probabilistic Entity Relationship Models 

A specific type of probabilistic entity-relationship model is the directed acyclic 
PERM (DAPER for short), which uses directed arc£0 to describe the rela- 
tionship between attributes [101]. DAPER makes relationships the first class 
objects in the modeling language, and encourages an explicit representation 
of conditional probabilistic distribution. The DAPER model consists of six 
classes [113]: (i) Entity classes: specify the classes of objects in real world; (ii) 
Relationship classes: represent the interaction among entity classes; (iii) At- 
tribute classes: describe properties of entities or relationships, (iv) Arc classes: 
represent probabilistic dependencies among corresponding attributes; (v) Lo- 
cal distribution classes: construct the local distributions for attributes corre- 
sponding to the attribute class; (vi) Constraint classes: specify how to derive 
inference graph (i.e., ground graph) from the corresponding DAPER model 
over the given instantiated domain. DAPER model assigns relationships the 
same importance as the entities. 

The DAPER model can be used in the situation where the relational struc- 
ture itself is uncertain. And it is more expressive than either PRMs or plate 
models\^\ Actually, DAPER combines the features of plate models and PRMs, 



6 Heckerman et al. [110] proved that the Gibbs sampling can be used to estimate the 
joint distribution of a dependency network. For a basic introduction and summary 
of the Gibbs sampling, see Ref. [112]. 

7 In graph theory, the term "arc" stands for directed link. 

8 The standard description of plate models can be found in Refs. [108,109]. Hecker- 
man et al. [101] provided a new definition of plate model, which is slightly different 
from the traditional one [108,109]. According to this new definition, the plate mod- 
els and DAPER models are equivalent, and a plate model can be invertible mapped 
to a DAPER model [101]. 
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and the relations between DAPER models, PRMs and plate models can be 
found in Ref. [101]. 



5.3 Stochastic Relational Models 



The key idea of SRM is to model the stochastic structure of entity relationships 
(i.e., links) via a tensor interaction of multiple Gaussian Processes (GPs), each 
defined on one type of entities [102]. 

Assuming that the observable links r are derived as local measurements of a 
real-value latent relation function t: U x V — > M, and each link is solely de- 
pendent on its latent value t^, modeled by the likelihood p(rij\tij). The latent 
relation function t is generated via a tensor interaction of two independent 
entity-specific GPs, one acting on U and the other on V. Note that U and V 
can both encompass an infinite number of entities. The relational processes are 
characterized by hyper-parameters © = {Os)@n}j where Gs and Qq are for 
the GP kernel function on IA and V respectively. Thus SRM defines a Bayesian 
prior p(t\Q) for the latent variables t. Let I be the index set of entity pairs 
having observed links, the marginal likelihood under such a prior is 



where Ri = r^, 6 I. The hyper-parameters G can be estimated by max- 
imizing the marginal likelihood. Then the link for a new pair of entities can 
be predicted by marginalizing over the a posteriori p(t\Hi, G). 

This model in fact defines a set of nonparametric priors on infinite dimen- 
sional tensor matrices, where each element represents a relationship between 
a tuple of entities. By maximizing the marginalized likelihood, information is 
exchanged between the participating GPs through the entire relational net- 
work, so that the dependency structure of links is messaged to the dependency 
of entities, reflected by the adapted GP kernels. Because the training is on a 
conditional model of links, this model offers a discriminative approach for link 
prediction, namely predicting the existences, strengths, or types of relation- 
ships based on the partially observed linkage network as well as the attributes 
of entities if given. Yu et al. further upgraded SRM with an edge-wise covari- 
ance with which the overall computational complexity can be reduced [114]. 
For more details one can see Refs. [102,115]. 




(ij)ei 



(40) 
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6 Applications 

The problem of link prediction has attracted much attention from disparate 
research communities. This is mainly attributed to its broad applicability. For 
some networks, especially biological networks such as protein-protein interac- 
tion networks, metabolic networks and food webs, the discovery of links or 
interactions is costly in the laboratory or the field. A highly accurate predic- 
tion can reduce the experimental costs and speed the pace of uncovering the 
truth [75,77]. Link prediction has also been applied in the analysis of social 
networks, such as the prediction of being actors in acts [116], the prediction 
of the collaborations in co-authorship networks [49], the detection of the un- 
derground relationships between terrorists [75], and so on. In addition, the 
process of recommending items to users can be considered as a link predic- 
tion problem in the user-item bipartite networks [117,118]. Actually, almost 
the same techniques as the similarity-based link prediction has been applied 
in personalized recommendation [62,119,120]. Accurate recommendation can 
be used in e-commerce web sites to enhance the sales [121]. Moreover, the 
link prediction approaches can be applied to solve the classification problem 
in partially labeled networks, such as the prediction of protein functions [35], 
the detection of anomalous email [122], distinguishing the research ares of 
scientific publications [123], and finding out the fraud and legit users in cell 
phone networks [124]. The following three subsections will introduce typical 
applications of link prediction. 



6. 1 Reconstruction of Networks 

Guimera and Sales-Pardo [18] firstly considered the reconstruction of networks 
from the observed networks with missing and spurious links. Although one can 
rank the observed and non-observed links according to their reliabilities (see 
Eq. 36), it is not easy to reconstruct the "true" network since generally no one 
knows how many missing and spurious links there are. Applying the similar 
techniques presented in Section 4.2, Guimera and Sales-Pardo [18] defined the 
reliability of a network A as 

R(A)= J] R xy = J] C(A xy = l\A°), (41) 

A xy =l,x<y A xy =l,x<y 

where R xy and L are defined in Eqs. 33 and 36, and the term A° is used 
to emphasize that the likelihoods are calculated according to the observed 
network. 
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Given A°, a straightforward idea is to find out the network A that maximizes 
the reliability defined by Eq. 41. However, the computation is too costly to 
be implemented. In practice, Guimera and Sales-Pardo [18] designed a simple 
greedy algorithm. Their algorithm starts by evaluating the link reliabilities 
for all pairs of nodes. Then, at each time step it removes the link with the 
lowest reliability and adds the link (not yet in the current network) with the 
highest reliability. This change is accepted if and only if the network reliability 
increases. If it is rejected, the link with the next lowest reliability and the not- 
yet-existent link with the next highest reliability will be the next candidate 
for swapping. The algorithm stops if it rejects five consecutive attempts to 
swap links. The observed network is set as the initialization of the algorithm, 
and it will consecutively become another network with higher reliability than 
the initial network. Guimera and Sales-Pardo [18] tested their algorithm by 
generating hypothetical observed networks A° from the true networks A T 
(the five true networks used for testing are introduced in Section 4.2). Each 
observation has a fraction of the true links removed and an identical number 
of random links added. 



Guimera and Sales-Pardo [18] compared the global network properties of the 
observed networks and these of the reconstructed networks. According to six 
metrics, clustering coefficient [511 , modularity [125], assortativity [89,90], con- 
gestabilityrM. synchronizabilit\ 10 and spreading threshold"!, the reconstruc- 
tion generally improves the estimates, indicating the validity of the approach. 
Notice that, the results from the greedy algorithm may be far different from the 
real optimum subject to the maximal reliability, thus we may expect even bet- 
ter estimates if one has developed a more effective and/or efficient algorithm. 
Readers should be warned that in both the algorithm and the preparation of 
observed networks, a latent assumption is that the number of missing links 
and the number of spurious links are equal. Since in the real systems, these 
two numbers may be very different (it is easy to image that in many net- 
works, such as metabolic networks and friendship networks, the missing links 
are much more than the spurious links) , the effectiveness of the algorithm still 
needs further validation. 



9 The congest ability refers to the maximal betweenness centrality which governs 
the transportation throughput of a network [126,127]. 

10 The synchronizability refers to the ratio between the largest and the smallest non- 
zero eigenvalues of the Laplacian matrix of a network, which quantifies the ability 
of synchronization under the framework of master stability analysis [128,129]. 

11 Ignoring the degree-degree correlations and applying the mean-field approxima- 
tion, the spreading threshold equals the ratio between the first and the second 
moments of the degree distribution [130,131]. 
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6.2 Evaluation of Network Evolving Mechanisms 



Since the groundbreaking work by Barabasi and Albert [41], the evolving 
models all the time lie in the center of the complex network study. A fun- 
damental difficulty is that for a given network or a target network property, 
there are generally many possible mechanisms and it is not easy to judge which 
one is the best. Taking the power-law degree distribution as an instance, the 
well-known mechanisms include rich gets richer [41], good gets richer [132], 
optimal design [133], Hamiltonian dynamics [134], merging and regeneration 
[135], stability constraints [136], and so on. Hence we can not easily know 
which factor(s) leads to the scale-free property of a real network, and in fact 
there can be so many models competing for the final explanation of a given 
real network. It is very hard to evaluate different models by comparing their 
resulted networks with the target network, since there are too many metrics 
for topological features [5]. As mentioned in Section 1, there are many mod- 
els about the topology of the Internet, some more accurately reproduce the 
degree distribution and the disassortative mixing pattern (e.g., see Ref. [19]) 
and some better characterize the /c-core structure (e.g., see Ref. [20]). To judge 
which model (i.e., which evolving mechanism) is better than the others is a 
tough task. 

Essentially speaking, an algorithm for link prediction makes a guess about the 
factors resulting in the existence of links, which is actually what an evolv- 
ing model wants to show. In other words, an evolving model in principle can 
be mapped to a link prediction algorithm. Therefore, we can quantitatively 
compare the accuracies of different evolving models with the help of the perfor- 
mance metrics for link prediction (see Section 2). We hope this methodology 
could provide a fair platform to compare different evolving models, which may 
be significant for the studies of network modeling. Next, we will show a real 
application about the Chinese city airline network, where each node represents 
a city with airport, and two cities are connected if there exists at least one 
direct airline between them [56]. 

It is well-known that the evolution of a city airline network is affected by not 
only the topological factor, but also the geographical factor [143] and exter- 
nal factors, such as population and economic level of a city [137]. As shown 
by Liben-Nowell et al. [49] and Zhou et al. [47], the common neighbor index 
is a good candidate to account for the topological effects. In addition, Cui 
et al. [138] developed an evolving model driven completely by the common 
neighborhood, which well reproduces not only the global network properties, 
but also the local structural features like power-law clique-degree distributions 
[139] of social and technological networks. Therefore, we simply use the com- 
mon neighbor index S CN (see Eq. 2) to represent the topological ingredient. 
Geographical distance is considered to be one of the realistic factors that affect 
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the existence of nodes' interactions in networks [140]. Especially it plays very 
important role in analyzing transportation networks [141,142]. It is known to 
be relevant to the existence of an airline, and the number of airlines decays 
with the increasing of corresponding distance [55,143]. Accordingly, we use the 
inverse of geographical distance between two cities as the similarity index, say 

s ° yIS = Dj^y (42) 



where D g (x, y) denotes the geographical distance between cities x and y. Based 
on a null assumption that people in every city have the same frequency of air 
travels, the similar index for populations is defined as 

s P ° PU = p (x) x P(y), (43) 



where P(x) is the population of city x. The economic level of a city can 
be roughly quantified by its gross domestic product (GDPlP^l and thus the 
corresponding similarity is defined as 

4 DP = >< G (y)> (44) 



where G(x) denotes the GDP of city x. Considering that the airline business 
is most tightly related to the service industry, besides the simple GDP, we 
user the third sector of GDP, named the tertiary industru 13 1 to characterize a 
city's potential to build airlines: 



T(x) x T(y), 



(45) 



where T(x) is the tertiary industry of city x. 

Since the size of the Chinese city airline network [56] is small (|V| = 121, 
\E\ = 1378), we adopt the leave-one-out method, namely at each time, we 
pick only one link for test and all other links constitute the training set. This 
procedure repeats for 1378 times with each link being once the testing link. 
Table 4 displays the prediction accuracy (AUC values) of the five similarity 

12 The GDP is a measure of a city's overall economic output. It is the market value 
of all final goods and services made within the borders of a city in a year. Here we 
use the data of the year 2005. 

13 The tertiary industry (also called tertiary sector of the economy, service sector 
or service industry) consists of the "soft" parts of the economy, namely activities 
where people offer their knowledge and time to improve productivity, performance, 
potential, and sustainability. The basic characteristic of this sector is the production 
of services instead of end products. 
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Table 4 

The prediction accuracy of the five similarity indices for the Chinese city airline 
network. The training and testing sets are divided according to the leave-one-out 
method. 

Similarity Indices AUC 

S CN 0.898 

S DIS 0.699 
s popu 0745 

S GDP 0.855 
S TI 0.881 

indices. It indicates that every factor under consideration plays a role, while 
the topological factor is most significant. The tertiary industry of a city, as an 
external factor, also plays a very important role. Actually, a linear combination 
of the common neighbor index and the tertiary industry, as S' = XS CN + (1 — 
X)S TI can achieve a very high AUC value, 0.928, at A ~ 0.2. 

Although the method introduced here is straightforward, it gives insights into 
the underlying evolving mechanisms which may not be seriously considered 
in the early studies. The validity of this method has been demonstrated by 
some recent evidences. For example, by comparing the evolving models driven 
respectively by the topological factor, the geographical factor, and the above- 
mentioned three external factors, Liu et al. [144] showed that only the one 
considering the tertiary industry can reproduce the observed double power- 
law degree distribution of the Chinese city airline network. In addition, among 
many external factors, the Granger causality tes^\ shows that the tertiary 
industry is the most significant factor in determining the passenger volume 
[146]. 

6.3 Classification of Partially Labeled Networks 

Given a network with partial nodes being labeled, the problem is to predict 
the labels of these unlabeled nodes based on the known labels and the network 
structure. Two main difficulties in achieving highly accurate classification are 
the sparsity of the known labeled nodes and the inconsistency of label infor- 
mation. To address these two difficulties, a simple but effective method is to 
add artificial connections between every pair of labeled and unlabeled nodes 
according to their similarity scores [123,147], with almost the same techniques 
used in similarity-based link prediction. An underlying assumption is that two 

14 The Granger causality test is a technique for determining whether one time series 
is useful in forecasting another. See Ref. [145] for details. 
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Fig. 5. An illustration of how to predict the fifth node's label by adding artificial 
links. 



nodes are more likely to be categorized into the same class if they are more 
similar to each other. 

Consider an unweighted undirected network of both labeled and unlabeled 
nodes: G(V, E, L), where V is the set of nodes, E is the set of links and 
L = {h, h, ■ ■ ■ , lm} is the set of labels. For the nodes without labels, we label 
them by 0. For each pair of nodes, x and y, a similarity index will assign a 
score as s xy . For an unlabeled node x, the probability it belongs to k is 



where k G L. The predicted label of node x is determined by the largest p(k\x). 
If there are more than one maximum values, we randomly select one. 

A simple example is shown in Fig. 5, where there are two kinds of labels (i.e. a 
and b) and five nodes, four of which are labeled already. Our task is to predict 
the label of the node 5. According to the common neighbors index S , we 
obtain the similarity between node 5 and the other four labeled nodes: S15 = 1, 
S25 = 1, S35 = 2 and s 45 = 0. Thus, the probabilities that node 5 belongs to 
class a and b are p(a|node5) = 0.75 and p(fe|node5) = 0.25 respectively. If 
we use RA index, the similarity scores are: S15 = |, s 25 = |, s 35 = | + \ 
and S45 = 0. Therefore, the probabilities change to p(a|node5) = 0.7 and 
p(6|node5) = 0.3. According to any of the two indices, the predicted label of 
node 5 is a. 



p(h\x) 




(46) 
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7 Outlook 



In this article, we briefly summarized the progress of studies on link prediction, 
emphasizing on the recent contributions by statistical physicists. Although link 
prediction is not a new problem in information science, traditional methods 
have not caught up the new development of network science, especially the 
new perspectives and tools resulted from the studies of complex networks. In 
our opinion, the studies of link prediction and complex networks will benefit 
each other, because the in-depth understanding of network structure can be 
used to design advanced link prediction algorithms (e.g., making use of the 
information about hierarchical organization [75] and modular structure [18] 
of real networks to better predict missing links) and the performance of a 
link prediction algorithm could give evidences about structural features [47] 
as well as the algorithms themselves can be used to improve the estimates of 
real networks' properties [18] and to evaluate the evolving network models. In 
a word, to statistical physicists, the study of link prediction is just unfolding. 

Up to now, the studies of link prediction overwhelmingly emphasize on the 
unweighted undirected networks. For directed networks, even the ternary rela- 
tions are complicated, and thus the simple common-neighbor-based similarity 
indices have to be modified to take into account the local motif structure [148]. 
Otherwise even we can predict the existence of an arc between two nodes, we 
can not determine its direction. In addition, the path-dependent similarity in- 
dices should also be extended to take into account the link direction [149] . The 
fundamental task of link prediction in weighted networks, namely to predict 
the existence of links with the help of not only the observed links but also 
their weights, has already been considered by Murata et al. [150] and Lii et al. 
[151]. The former [150] suggested that the links with higher weights are more 
important in predicting missing links, while the latter [151] indicated a com- 
pletely opposite conclusion: the weak links play a more significant role. How 
to properly exploit the information of weights to improve the prediction accu- 
racy is still an unsolved problem. A harder problem is to predict the weights 
of links, which is relevant to the traffic prediction for urban transportation 
and air transportation systems [152]. We are expecting some variants of link 
prediction algorithms can also contribute to this domain. 

A big challenge is the link prediction in multi-dimensional networks, where 
links could have different meanings. For example, a social network may consist 
of positive and negative links, respectively pointing to friends and foes [153] , or 
trusted and distrusted peers [154]. Leskovec et al. [155] proposed a method to 
predict the signs of links (positive or negative), yet the prediction of both the 
existence of a link and its sign has not been well studied. Recent development 
of social balance theory may provide useful hints [156,157,158]. 
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A more complicated kind of multi-dimensional networks is the ones consisted 
of several classes of nodes. For example, an online resource-sharing system, 



such as Del.icio.ua 15 1, can be represented by a network that consists of three 
kinds of nodes: users, URLs and tags. Different from the tripartite networks, 
nodes in the same class can also be connected, like an arc can be added from 
a user to her/his follower who has imposed her/his collections. Ignoring the 
connections within a class of nodes, the prediction of links between users and 
objects has already been investigated [159]. However, there is still nothing 
reported about the link prediction algorithms taking into account both the 
links within a class and the links between classes. 

Inspired by the success in recommender systems, we think the prediction accu- 
racy can be considerably improved by hybrid algorithms [160]. Given a specific 
target network, we can implement many individual prediction algorithms, and 
then try to select and organize them in a property way. This so-called ensem- 
ble learning method can obtain better prediction performance than could be 
obtained from any of the individual algorithms [161]. Although the scientific 
significance of such a method is not clear to us, building ensemble systems for 
link prediction could be of huge practical value. 

The algorithms' performance can be effectively enhanced by considering some 
external information, like the attributes of nodes [31]. In common sense, two 
people share more tastes and interests (and thus may of higher probability to 
be connected in a social network) if they have more common features, such 
as age, sex, job, and so on. The attribute information can be used to pre- 
dict links without considering the network structures. Thus, when the existed 
links themselves are unreliable, attribute-based methods are preferable, which 
can to some extent solve the so-called cold start problem - a big challenge 
of link prediction [162]. Besides, community structures can also help improv- 
ing prediction accuracy [163]. In social networks, since one person may play 
different roles in different communities, the prediction in one domain can be 
inspired by the information in others [164]. For example, when we predict the 
collaborations between authors, we can consider their affiliations to improve 
the accuracy. 

Most of current approaches take into account a single snapshot of a network 
to predict the missing or future links. Extensive experiments show that these 
methods well uncover whether a link exists. However, this static graph rep- 
resentation is difficult in predicting the repeated link occurrences. For exam- 
ple, it is impossible to predict whether and when two authors will collaborate 
again in co-authorship network. Addressing this problem, Huang and Lin [165] 
proposed a time-series link prediction approach considering the temporal evo- 



15 Del.icio.us is the largest social bookmarking system where a user is allowed to 
collect URLs as well as visit and impose other users' collections. 
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lutions of link occurrences, which is more appropriate for dealing with the link 
prediction problem in evolving networks, such as online social networks. An- 
other way to involve time information is inspired by the fact that older events 
are less likely to be relevant to future links than recent ones. For example, 
author's interests may change over time and thus old publications might be 
less relevant to his currents research area. Tylenda et al. [166] developed a 
graph-based link prediction method that incorporate the temporal informa- 
tion contained in evolving networks. They found that the performance can be 
improved by either time-based weighting of edges (i.e., giving the older events 
smaller weights or even neglecting them) or weighting of edges according to 
the connecting strength. However, to design effective algorithms and eventu- 
ally settle this problem, we need in-depth and comprehensive understanding 
of temporal effects on human's interests, attentions and so on, which asks for 
extensive empirical analyses. 
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